Posts with tag Data Munging
Back to all postsBleg 1: String Distance
Mar 27 2014
String distance measurements are useful for cleaning up the sort of messy data from multiple sources.
There are a bunch of string distance algorithms, which usually rely on some form of calculations about the similarities of characters. But in real life, characters are rarely the relevant units: you want a distance measure that penalized changes to the most information-laden parts of the text more heavily than to the parts that are filler.